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Abstract 

We present a new environment for computations in particle physics phenomenology 
employing recent developments in cloud computing. On this environment users can 
create and manage "virtual" machines on which the phenomenology codes /tools can be 
deployed easily in an automated way. We analyze the performance of this environment 
based on "virtual" machines versus the utilization of "real" physical hardware. In this 
way we provide a qualitative result for the influence of the host operating system on 
the performance of a representative set of applications for phenomenology calculations. 
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1 Introduction 



Particle physics is one of the main driving forces in the development of computing and 
data distribution tools for advanced users. Nowadays computations in particle physics phe- 
nomenology take place in a diversified software ecosystem. In a broad sense we can speak in 
terms of two different categories: commercial or proprietary software, and software developed 
by the scientific collaborations themselves. 

Commercial software is distributed under the terms of a particular end-user license agree- 
ment, which defines how and under which circumstances the software is going to be deployed 
and used. In the field of particle physics phenomenology such agreements are undertaken 
by the scientific institutions, which afterwards offer this software as a service to their re- 
searchers. This is the case of the most common software packages employed in the area, 
such as Mathematica, Matlab, etc. 

Scientific collaborations develop also their own software, often in open source mode under 
a copy/left license model. In this way researchers can download this software, use it as it is, 
or implement modifications to better solve their particular analysis following a GNU General 
Public License 

From a technical point of view, most of the codes are developed on Fortran or C/C++. 
They become very modular, because typically they are the result of the work of a collab- 
orative team on which each member is in charge of a particular aspect of the calculation. 
Software packages evolve with the necessity of analyzing new data, simulating new scenarios 
at present and future colliders. The evolution implies the inclusion of new modules, or func- 
tions, which call and interconnect other modules in the code, and/or make external calls to 
proprietary software like Mathematica to perform basic calculations. 

The knowledge of the collaboration and the basics of the physics approach often resides 
in the core parts of the code, which remain almost unaltered for years, but, which need 
to be carried along as the development of the software package takes place to include new 
features. The core of the software package acts like a sort of legacy code. The inclusion of 
new modules to the software package needs to be done in such a way that these legacy parts 
remain untouched as much as possible, because its modification would affect all modules 
already present in ways sometimes very difficult to disentangle, or to predict. All this 
reflects in difficulties when it comes to compile those codes together with more modern ones. 
Often there are issues with Fortran compilers which cannot be easily solved and require a 
very deep insight in the code to be able to install it. 

Some of the codes developed in the framework of scientific collaborations are not open- 
source, and therefore the sources are closed to external researchers. This reflects situations of 
competitiveness between groups, and the fact that often it is on the code were the knowledge 
of the group resides, and needs to be protected due to Intellectual Property Rights (IPR). In 
such situations the collaboration makes available externally only executable binaries, which 
poses limitations on the architecture and operating systems, library versions, etc. on which 
the codes can be executed. 

A further level of integration arises when one needs to deal with complex workflows. This 
is a most common scenario in particle phsyics phenomenology computations: each step of 



1 See http : //www . gnu . org/ software/ gsl/manual/html_node/GNU-Genera l-Public-License . html| for 
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the calculation requires as input the output of the previous code in the workflow. Therefore, 
the installation of several of software packages is unavoidable nowadays when, for instance, 
the work concerns simulation, prediction or analysis of LHC data/phenomenology. The 
installation of several of those software packages on the same machine is often not trivial 
since one needs to install potentially conflicting software on the same machine: different 
libraries for each of these software packages, sometimes even different compiler versions, etc. 

The scenario described results in practical difficulties for researchers, which translate into 
time consuming efforts for software deployment, up to impossibility of deployment due to 
software or architecture restrictions. 

There is a general agreement in the community that setting up a proper computing en- 
vironment is becoming a serious overhead for the everyday work of researchers. It is often 
the case that they need to deploy locally in their clusters (or even on their own desktops) 
all the software packages required for the calculations, each of them with their particular 
idiosyncrasies regarding compiler versions, dynamic libraries, etc. In this case the interven- 
tion of cluster system managers is also not of much help because a generic cluster cannot 
accommodate so many options without disturbing the work of everyone, or generating an 
unsustainable work overhead to the system administrator. 

The main idea of this work is to exploit the flexibility of operating system virtualization 
techniques to overcome the problems described above. We will demonstrate how the already 
available solutions to deploy cloud computing services [I] can simplify the life of researchers 
doing phenomenology calculations and compare the performance to "more traditional" in- 
stallations. 

As will be shown along the article, one obvious solution where virtualization can help 
with the problems described above is the deployment of tailored virtual machines fitting 
exactly the requirements of the software to be deployed. This is specially the case when one 
deals with deploying pre-compiled binaries. However, the work described here aims for a 
more complete solution going from user authentication and authorization, to automation of 
code installation and performance analysis. 

We want to remark that virtualization tecniques, as a tool for allowing the simplification 
of maintenance and deployment of distributed infrastructures, are already used at most 
centers involved in the European Grid Infrastructure [2]. However this work is motivated by 
the necessity of exploring a more efficient use of computing resources also at the level of the 
end-user. Therefore we also want to explore the ways in which a similar cloud service could 
be expanded and offered to users of the Grid infrastructures. For this purpose a mechanism 
of authentication and authorization based on VOMS [3] has been developed and integrated 
as a part of the user service provision model. 

Code performance using virtualized resources versus performances on non-virtualized 
hardware is also a subject of debate. Therefore it is interesting to make an efficiency analysis 
for real use cases, including self-developed codes and commercial software in order to shed 
some light on the influence on the performance of the host operating system on virtualized 
environments. 

The hardware employed for all the tests described in the article is a server with 16GB 
of RAM (well above the demand of the applications) with four Intel Xeon Processors of the 
family E3-1260L, 8M of Cache and running at 2,40GHz. In order to have a meaningfull 
evaluation, we have disabled the power efficiency settings in the BIOS to have the processors 
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running at maximum constant speed. We have also disabled the Turbo boost features in the 
BIOS because it increases the frequency of the individual cores depending on the occupancy 
of the cores, therefore distorting our measures. For the sake of completeness we have eval- 
uated as well the influence of enabling the Hyperthreading features of the individual cores 
to demonstrate how virtualization and Hyperthreading together influence the performance 
of the codes. 

The layout of the article is as follows. In Section [5] we describe the architecture and im- 
plementation of the proposed solutions; Sections [3] and H] analyze two different real use cases, 
together with the respective performance evaluations. The first case focuses on the effects of 
the "virtual" environment on single process runs, whereas the second case deals with the po- 
tential speed-up via MPI parallelization on virtualized hardware. The last section contains 
our conclusions. The very technical details about authentication and user authorization as 
well as detailed numbers about our comparisons can be found in the Appendix. 



2 Cloud Testbed and Services 
2.1 OpenStack deployment 

The deployment of cloud computing services requires the installation of a middleware on top 
of the operating system (Linux in our case), that enables the secure and efficient provision 
of virtualized resources in a computing infrastructure. There are several open-source mid- 
dleware packages available to build clouds, with OpenNebula jl] and OpenStack [5] being 
the most used in the scientific data-centers of the European Grid Infrastructure. 

After an evaluation of both OpenNebula and OpenStack, we have chosen the latter as 
middleware for our deployment due to its good support for our hardware and its modular 
architecture, which allows it to add new services without disrupting the already existing 
ones, and to scale easily by replicating services. OpenStack has a developer community 
behind that includes over 180 contributing companies and over 6,000 individual members 
and its being used in production infrastructures like the public cloud at RackSpacdE Being 
written in Python is also an advantage since we can profit from our expertise in the language 
to solve problems and extend the features of the system. 

OpenStack is designed as a set of inter-operable services that provide on-demand re- 
sources through public APIs. Our OpenStack deployment, based on the Essex release (fifth 
version of OpenStack, released on April 2012), has the following services, see Fig. [1] 

• Keystone (identity service), provides authentication and access control mechanisms for 
the rest of components of OpenStack. 

• Nova (compute service) manages virtual machines and their associated permanent 
disks (Volumes in OpenStack terminology). The service provides an API to start and 
stop virtual machines at the physical nodes; to assign them IP addresses for network 
connectivity; and to create snapshots of running instances that get saved in the Volume 
area. The volumes can also be used as a pluggable disk space to any running virtual 
machine. 

2 Scc http : / / www . rackspace . com/ cloud/. 
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• Glance (image management service) provides a catalog and repository for virtual disk 
images, which are run by Nova on the physical nodes. 

• Horizon, a web-based dashboard that provides a graphical interface to interact with 
the services. 

OpenStack provides also a object storage service but it's not currently used in our deploy- 
ment. 
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Figure 1: OpenStack deployment. Keystone provides authentication for all the services; 
Nova provides provisioning of virtual machines and associated storage; Glance manages the 
virtual machine images used by Nova; Horizon provides web-based interface built on top of 
the public APIs of the services. 



Nova provides virtual machines with 16 servers as described in the introduction. Volume 
storage for the virtual machines is provided using two identical servers with a quad-core Intel 
Xeon E5606 CPU running at 2.13GHz with 3GB of RAM, 4 TB of raw disk and two 1Gb 
Ethernet. Glance runs on a server with similar hardware. 

The use of an open-source software allows us to adapt the services to better suit the needs 
of a scientific computing environment: we have expanded the authentication of Keystone to 
support VOMS and LDAP-based identities as shown in Appendix A and we have developed 
an image contextualization service with a web interface built on top of Horizon. 

2.2 Image contextualization 

In an infrastructure as a service cloud, users become the administrators of their machines. 
Instead of submitting jobs with their workload to a batch system where the software is 
previously configured, they are provided with virtual machines with no extra help from the 
resource provider. The installation and configuration of any additional software must be per- 
formed by the final users. This provides users with flexibility to create tailored environments 
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for running their software, but requires them to perform tedious administrative operations 
that are prone to errors and not of interest for most users. 

We have developed an image contextualization service that frees the user from down- 
loading, configuring and installing the software required for their computations when the 
virtual machine is instantiated. This service has three main components: a contextualizer 
that orchestrates the whole process and takes care of application dependencies; an applica- 
tion catalog that lists all the available applications; and a set of installation scripts that are 
executed for the downloading and configuration of each application. All of them are stored 
in a git repository at githuto. 

The contextualizer makes use of the user supplied instance meta-data that is specified at 
the time of creation of the virtual machine. This is a free form data that is made available 
to the running instance through a fixed URL. In our case, the contextualizer expects to find 
a JSON dictionary with the applications to install on the machine. 

The application catalog is a JSON dictionary, where each application is described with 
the following fields: 

• app_name: human readable application name, for showing it at user interfaces. 

• base_url: download URL for the application. 

• file: name of the file to be downloaded, relative to the base_url. 

• dependencies: list of applications that need to be installed before this one. 

• installer: name of the contextualization script that installs the application. 

• versions: dictionary containing the different available versions of the application. 
Inside this dictionary, there is an entry for each version where at least a version_name 
entry specifies a human readable name for the version. Optionally, it may include 
any of the fields in the application description, overriding the default values for the 
application. 

The only mandatory fields are the installer and versions. A sample entry is shown 
below. The application name in this case is FormCalc and depends on the FeynHiggs 
application^. There are two different versions, 7.0.2 and 7 . 4, with the first one overriding 
the default value for the base_url: 

"FormCalc": { 

"app_name": "FormCalc", 
"dependencies": [ 
"FeynHiggs" 

], 

"installer": "f eyntools . sh" , 

"base_url" : "http://www.feynarts.de/formcalc/" , 
"versions" : { 



3 See https://github.com/enolfc/feynapps for details. 
4 See Sect. 13.21 for more details on these codes. 
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"7.0.2": { 

"base_url" : "https : //devel . if ca. es/~enol/f eynapps/" , 
"app.version" : "7.0.2" 

}, 

"7.4": { 

"app_version" : "7.4" 

} 

> 

} 

When a new virtual machine is started, it pulls the latest changes from the git repository, 
thus having an up to date description of the applications and the scripts. The contextualizer 
then fetches instance meta-data, and for each application listed, it executes the installation 
script passing as parameters the application information from the catalog. This script will 
download and perform the required steps to install the application (configuration, compila- 
tion, etc.). If the application has any dependencies, the contextualizer will install them first, 
taking care of avoiding duplicated installations. The use of the git repository for manag- 
ing our contextualization service, allows to update the tools without having to recreate the 
virtual machine images. 

To ease the use of the service, we have also extended the OpenStack dashboard to offer 
the contextualized instances from a web-based graphical interface. Fig. [2] shows this contex- 
tualization panel in horizon. The panel is a modified version of the instance launch panel, 
where a new tab includes the option to select which applications to install. The tab is cre- 
ated on the fly by reading the application catalog from a local copy of the git repository at 
the horizon machine — changes in the application catalog are made available with a periodic 
pull of the repository. For each selected application, the panel will include it in the instance 
meta-data, which will be used in turn by the contextualizer to invoke the scripts. 
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Figure 2: Image contextualization panel in Horizon. For each available application in the 
catalog, the user can select which version to install. 
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The panel restricts the images that can be instantiated to those that are ready to start the 
contextualization on the startup, which are identified in glance with the property f eynapps 
set to true. This avoids errors due to selection of incorrect images and facilitates the addition 
of new images in the future without changing the dashboard. 

In order to test the performance of the virtual systems, we compare execution times in 
the above described set-up with the execution time in the real (physical) machine, consisting 
of a server as described in the introduction. In this case, the various codes are directly, 
manually installed and executed on that machine. 

3 Use Case: single processes on virtual machines 

The first use case analyzed here concerns the evaluation of the decay properties of (hypo- 
thetical) elementary particles. The description of the underlying physics will be kept at a 
minimum; more details can be found in the respective literature. 

3.1 The physics problem 

Nearly all results of high-energy physics results are described with highest accuracy by the 
Standard Model (SM) of particle physics [6]. Within this theory it is possible to calculate 
the probabilities of elementary particle reactions. A more complicated theory that tries to 
go beyond the SM (to answer some questions the SM cannot address well) is Supersymmetry 
(SUSY), where the most simple realization is the Minimal Supersymmetric Standard Model 
(MSSM) [7]. Within this theory all particles of the SM possess "SUSY partner particles". 
The physics problem used in our single-process example concerns the calculation of the 
desintegration probabilities of one of these SUSY partner particles, the so-called "heaviest 
neutralino", which is denoted as x°. 

In the language of the MSSM the two desintegration modes investigated here are 



Here x? denotes the dark matter particle of the MSSM, h\ is a Higgs boson, W~ is a SM 
particle responsible for nuclear decay, and xt is a corresponding SUSY partner. More details 
can be found in Ref. [8]. 

The evaluation is split into two parts. The first part consists of the derivation of an- 
alytical formulas that depend on the free parameters of the model. These parameters are 
the masses of the elementary particles as well as various coupling constants between them. 
These formulas are derived within Mathematica [9] and are subsequently translated into 
Fortran code; the second part consists of the evaluation of the Fortran code, see below. 
Numerical values are given to the free parameters (masses and couplings) and in this way 
the desintegration properties for (Tj[|), (T5]) are evaluated. In the case of (T5J) this includes 
also an additional numerical integration in four- dimensional space-time, which is performed 
by the Fortran code. However, no qualitative differences habe been observed, and we will 
concentrate soley on process ([T]) in the following. 



xl -+ xfw- 
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3.2 The computer codes and program flow 

In the following we give a very brief description of the computer codes involved in our analy- 
sis. Details are not relevant for the comparison of the different implementations. However, it 
should be noted that the codes involved are standard tools in the world of high-energy physics 
phenomenology and can be regarded as representative cases, permitting a valid comparison 
of their implementation. 

The first part of the evaluation is done within Mathematica [5] and consequently will 
be called "Mathematica part" in the following. It uses several programs developed for the 
evaluation of the phenomenology of the SM and MSSM. The corresponding codes are 

• FeynArts [TO] : this Mathematica based code constructs the "Feynman diagrams" and 
"amplitudes" that describe the particle decay processes ([T]) and (jSJ). 

• FormCalc [11] : this Mathematica based code takes the "amplitudes" constructed by 
FeynArts and transforms them into analytical formulas in Fortran. For intermediate 
evaluations, FormCalc also requires the installation/use of Form [12], which is dis- 
tributed as part of the FormCalc package. 

• LoopTools [TTJ: this Fortran based code provides four- dimensional (space-time) inte- 
grals that are required for the evaluation of the decay properties. 

Not all parameters contained in the analytical formulas are free, i.e. independent param- 
eters. The structure of the SM and the MSSM fixes several of the parameters in terms of 
the others. At least one additional code is required to evaluate the dependent parameters in 
terms of the others, 

• FeynHiggs [13]: this Fortran based code provides the predictions of the Higgs particles 
(such as h x in Eq. (JJ)) in the MSSM. 

The program flow of the Mathematica part is as follows. A steering code in Mathematica 
calls FeynArts and innitates the analytical evaluation of the decay properties of reaction ([Tl) 
or (T5]). In the second step the steering code calls FormCalc for further evaluation. After the 
analytical result within Mathematica has been derived, FormCalc generates a Fortran code 
that allows for the numerical evaluation of the results. The code LoopTools is linked to this 
Fortran code. Similarly, also FeynHiggs is linked to this Fortran code. After the creation of 
the Fortran code finishes, the Mathematica part terminates. The results of these analytical 
evaluations for the particle processes under investigations as well as for many similar cases 
(which used the same set of codes) have been verified to give reliable predictions [T3] . 

The second part of the evaluation is based on Fortran and consequently will be denoted 
as "Fortran part" in the following. It consists of the execution of the Fortran code created in 
the Mathematica part. One parameter of the model is scanned in a certain interval, whereas 
all other parameters of the model are kept fixed. The calculation of the decay properties are 
performed for each value of the varied parameter. To be definite, in our numerical examples 
we have varied the complex phase of one of the free parameters, (fM 1 , between 0° and 360° 
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in steps of one degree. In each of the 361 steps two parameter configurations are evaluated. 
Thus, in total the Fortran part performs 722 evaluations of the decay properties. As a 
physics side remark, the results are evaluated automatically in an approximate way (called 
"tree") and in a more precise way (called "full"). The results of the Fortran part are written 
into an ASCII file. As an example of this calculation we show in Fig. [3] the results for the 
decay ([I]) for the two parameter configurations called S g and Sh (both, "tree" and "full") 
as a function of the parameter that is varied, <Pmi- More details about the physics can be 
found in Ref. 0. 
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Figure 3: Example output of the evaluation of the properties of decay ([!]) for the two 
parameter configurations, S g and Sh, in the approximation ("tree") and the more precise 
way ("full") as a function of ip^i [§]■ The decay property T is given in its natural units 
"GeV" (Giga electron Volt). 
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3.3 Performance analysis 

We have measured the performance of the calculation of decay processes (CQ) and (J2]) in a 
virtualized environment. As our set-up we have instantiated virtual machines as described in 
Sect. [2J including the necessary computational packages among them Mathematica, FeynArts, 
FormCalc, FeynHiggs, see above. 

Since the nature of the codes is quite different, the computational time has been measured 
separately for the Mathematica part of the computation, and for the Fortran part of the code 
which involves basically Floating Point computing (i.e. without the load on file handling 
and input /ouput). 

In order to fix our notation we introduce the following abbreviations: 

• Sht,uHt{c) denotes a virtual machine consisting on c cores and 2GB of RAM. 

• Mht,uHt(c) denotes a virtual machine consisting on c cores and 4GB of RAM. 

• Lht,uHt(c) denotes a virtual machine consisting on c cores and 7GB of RAM. 

• XLht,uHt{c) denotes a virtual or physical machine with c cores and 14GB of RAM. 

The subscripts HT and nHT refer to Hyperthreading enabled or disabled on the virtual 
machine, respectively. For instance, M#^(2) denotes a virtual machine with two physical 
cores, Hyperthreading enabled (i.e. 4 logical cores) and 4GB of RAM. 

3.3.1 Single process on multicore virtual machines 

In our first test we submit a single process to the system (regardless of how many cores are 
available). We plot in Fig. H]the time that only the Mathematica part of the code takes, as 
a function of the configuration of the machine employed. 

As we see the Mathematica part is hardly affected by the size of the machine, once 
the virtual machine large enough. The effect observed with Sht(1) is an overhead due to 
the extra work that the only core needs to do to handle both, Mathematica and the guest 
Operating System. Hyperthreading is not enough to overcome the penalty in performance if 
only one core is used. However when more than one core is available one can see a constant 
performance regardless of the size of the virtual machine, and also regardless or wether 
Hyperthreading is enabled or not. 

We have also included in this figure the comparison with the time it takes on the XLht{8) 
machine without virtualization, what is called the "physical machine". We see the physical 
machine is only slightly faster, about a 1%. The degradation of performance in this case is 
therefore minimal. A more detailed comparison of virtual and real machines can be found 
below. 

Results turn out qualitatively different in the analysis of the Fortran part of the code, 
as can be seen in Fig. [5j This part is dominated by Floating Point calculations and few 
input/output or file handling, The first difference we see already at the smaller machines, 
where we do not observe anymore overheads due to the size of the virtual machine. The 
second difference to the Mathematica part of the code is that enabling the Hyperthreading 
does imply a penalty on performance on the order of a 4%. This is to be expected on 
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S_HT{1) M_HT{2) L_HT(4) XL_HT (8) MnHT (2) L_nHT (4) 



Figure 4: Execution time in seconds of the Mathematica part. One single process has been 
started on the different virtual machines configurations. The execution time on the equivalent 
real physical machine has been included for comparison for XLht{8)- The corresponding 
detailed numbers can be found in Tabs. [TJ [31 



general grounds due to the performance caveats induced by Hyperthreading on floating- 
point dominated applications, coming from the fact that the cores are not physical but 
logical, and the FPU unit is the same physical one for the two logical cores. 

As for the comparison with the physical machine without virtualization, again shown for 
XLht{8), we see that virtualization has degraded performance by about a 3% which is still 
a very small impact. Thus the influence of the host operating system is very small in low 
load situations. 

For both parts of the evaluation, the Mathematica part and the Fortran part, the per- 
centage of system time employed during the computations is negligible. For the Mathematica 
dominated part of the computation it starts at 3% in Sht(1), to decrease down to a 1,5% 
in the rest of series. In the Fortran part it stays constant at about 0.2%. 

3.3.2 Multiple simultaneous processes on multicore virtual machines 

In this section we investigate the behavior of the performance in virtual machines under 
high load circunstances. For that we use a machine with 4 physical cores, Hyperthreading 
enabled, thus 8 logical cores. 

To fix the notation we have adapted the previous definition as follows (in this test Hy- 
perthreading is always enabled, therefore we drop the subscript for simplicity) 

• M(c/p) denotes a virtual machine consisting on c cores and 4GB of RAM and p con- 
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S_HT(1) M_HT(2) L_HT{4) XL HT (8) MnHT (2) L_nHT(4) 



Figure 5: Execution time in seconds of the Fortran part. One single process has been started 
on the different virtual machines configurations. The execution time on the equivalent real 
physical machine has been included for comparison for XLht{8)- The corresponding detailed 
numbers can be found in Tabs. [T] [31 



current processes running. 

• L(c/p) denotes a virtual machine consisting on c cores and 7GB of RAM and p con- 
current processes running. 

• XL[cjp) denotes a virtual or real machine with c cores and 14GB of RAM and p 
concurrent processes running. 

The test was performed as follows. First we instantiate a virtual machine with a number 
of logical cores c. Then we start from p = 1 up to p = c simultaneous processes in order to 
fill all the logical cores available, and measure how long each of the simultaneous processes 
takes to complete. Since not all the simultaneous processes take the same time to complete, 
we have taken the time of the slowest one for the plots. Conservatively speaking, this is 
the real time that the user would have to wait. The difference between the maximum and 
minimum times is not significative for our analysis (see Tabs. HI E] in Appendix B for more 
details on actual times). 

In Fig. [6] we plot the execution time in seconds of the Mathematica part of the code for 
the M, L and XL machines with various number of processes as described above. In the 
XL case, for comparison, we also show the execution time in the real physical machine. The 
first observation is that the degradation on the performance appears only when we load the 
system with more processes than the existing physical cores (i.e. more than 4). Thus we 
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Figure 6: Execution time in seconds of the parts of the calculation involving Mathematica. 
The execution time on the equivalent real physical machine has been included for comparison. 
The corresponding detailed numbers can be found in Tabs. [H [5j 



conclude that this is not an effect of virtualization, but rather of Hyperthreading. In the 
comparison of the virtual and the real machines, shown for XL(8/n) in Fig. |6l one can see 
that the virtualization does not really imply a penalty on the performance. 

An interesting effect in this comparison can be observed when submitting p = 6 or more 
simultaneous processes. Against intuition the physical machine execution time is larger 
than the virtual machine execution time. This fact can only be explained if the virtualized 
operating system manages to handle better the threads than the normal operating system, 
which relies only Hyperthreading to distribute the system load. 

To investigate this effect we plot in Fig. [7] the percentage of system time which the 
operating systems employed on the runs. We can see how at XL(8/6) the physical machine 
does spend less sytem time than expected, and indeed, it is not managing the load of the 6 
processes on the 8 logical cores in the most efficient way. In this case the spread in execution 
time between the fastest and the slowest processor es very large (2572 seconds versus 1899 
seconds, where the latter is faster than the fastest time on the virtual machine, 2359 seconds), 
see Tabs. HI [5] in Appendix B. 

To conclude we plot in Fig. [S] the equivalent execution times in the Fortran dominated 
part of the calculation. We see that essentially the same pattern of behavior reproduces: 
The load of the machines have a sizable effect on the execution time only for more than 
4 simultaneous processes, and the virtual and real machines show negligible differences. 
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Figure 7: Percentage of system time employed by the virtual machine in the Mathematica 
part. The same percentage on the equivalent real physical machine has been included for 
comparison in the XL case. The corresponding detailed numbers can be found in Tabs. |4j [51 
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Figure 8: Execution time in seconds of the Fortran part. The execution time on the equiv- 
alent real physical machine has been included for comparison for the XL case. The corre- 
sponding detailed numbers can be found in Tabs. HI |5j 
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4 Use Case: MPI Parallelization 



The second use case analyzed here concerns a parameter SCcLll clS db typical application in the 
field of high-energy physics phenomenology. It also constitutes a perfect example that can 
be easily parallelized, see below for more details. For each point in the parameter scan an 
evaluations of Higgs boson properties that depend on this parameter choice is performed. As 
in the previous section, the description of the underlying physics will be kept at a minimum, 
and more details can be found in the respective literature. 

4.1 The physics problem 

Also this physics problem is taken from the MSSM. This model possesses several free param- 
eters. Since they are unknown, a typical example of an analysis within this model requires 
extensive parameter scans, where the predictions for the LHC phenomenology change with 
the set of the scanned parameters. 

After the discovery of a Higgs-like particle at the LHC [T5JEEE] the Higgs bosons of the 
MSSM are naturally of particular interest. The most relevant free parameters of the MSSM 
in this respect are 

Ma and tan/3 . (3) 

Ma denotes the mass of a Higgs particle in the MSSM, /3 is a "mixing angle", see Ref. [17] 
for further details. 

A typical question for a choice of parameters is, whether this particular combination 
of parameters is experimentally allowed or forbidden. A parameter combination, in our 
case a combination of Ma and tan/3, can result in predictions for the Higgs particles that 
are in disagreement with experimental measurements. Such a parameter combination is 
called "experimentally excluded" . In the example we are using, two experimental results are 
considered. The first are the results from the LHC experiment itself. The other set are the 
results from a previous experiment, called "LEP" |18j . 

4.2 The computer codes and program flow 

In the following we give a very brief description of the computer codes involved in this anal- 
ysis. Details are not relevant for the comparison of the various levels of parallelization. As in 
the previous example, it should be noted that the codes involved constitute standard tools 
in the world of high-energy physics phenomenology and can be regarded as representative 
cases, permitting a valid comparison of their implementation. 

The main code that performs the check of a Higgs prediction with results from the LHC 
and LEP is 

• HiggsBounds |T9]: this Fortran based code takes input for the model predictions from 
the user and compares it to the experimental results that are stored in the form of 
tables (which form part of the code). 

The predictions for the Higgs phenomenology are obtained with the same code used in 
the previous section, 
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• FeynHiggs [13]: this Fortran based code provides the predictions of the Higgs particles 
in the MSSM. 

In our implementation a short steering code (also in Fortran) contains the initialization 
of the parameter scan: two loops over the scan parameters, Ma and tan/3, are performed in 
the ranges (omiting physical units), 

M A = 90 . . . 500 , 
tan/3 = 1.1... 60 , (4) 

with 120 steps in each parameter, resulting in 14400 scan points. As a physics side remark: 
the other free parameters are set to fixed values, in our case according to the m™ ax scenario 
described in Ref. [17]. However, details are not relevant for our analysis. 

The steering code calls the code HiggsBounds, handing over the scan parameters. In- 
ternally HiggsBounds is linked to FeynHiggs, again handing over the scan parameters. 
FeynHiggs performs the prediction of the Higgs phenomenology, and the results are given 
back to HiggsBounds. With these parameters the code can now evaluate whether this 
parameter combination is allowed or disallowed by existing experimental results. The corre- 
sponding results are stored in a simple ASCII file, where one file contains the points excluded 
by the LHC, another file the points excluded by LEP. As an example, we show in Fig. Othe 
results for this scan in the two-dimensional M^-tan /3 plane. Points marked in red, accord- 
ing to the evaluation with HiggsBounds/FeynHiggs are in disagreement with experimental 
results from the LHC, and blue points are in disagreement with experimental results from 
LEP. White points are in agreement with the currently available experimental results. 

4.3 MPI parallelization 

The parameter scan performed by the code is a typical example of an embarrassingly parallel 
computation, where each parameter evaluation can be computed independently of the others, 
without requiring any communication between them. This kind of problems can be easily 
parallelized by dividing the parameter space into sets and assign them to each available 
processor. An OpenMP [20] parallelization was discarded due to the use of non thread-safe 
libraries in the code, so we opted for using MPI [21] for developing the parallel version of 
the code. 

In the parallel version, the steering code in Fortran was modified to have a single process 
that initializes the computation by setting the number of steps (by default 120 steps in each 
parameter) and values for the fixed free parameters and broadcasting all these values to the 
other processes in the computation. The parameter space is then divided equally among all 
processes, which perform the evaluation and write their partial results to independent files 
without any further communication between processes. Once the computation finishes, the 
partial results files are merged into a single file with all results. A master/worker paralleliza- 
tion with dynamic assignment of the parameters to each worker was not considered because 
the execution time per evaluation is almost constant hence there is no need to balance the 
work load between the workers. 
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Figure 9: Example output of the MSSM scan in the two free parameters Ma and tan/3. 
The parameter Ma is given in its natural units "GeV" (Giga electron Volt). 



4.4 Performance analysis 

We have measured the scalability and performance of the two-dimensional M^-tan /3 plane 
scan described in Section I4.2I with 14400 scan points in a virtualized environment. As in 
the previous case, we have instantiated the virtual machines using our contextualization 
mechanism to install FeynHiggs and HiggsBounds packages. The MPI code was compiled 
with Open MPI vl.2.8 [22] as provided in the Operating System distribution. 

These tests were performed on virtual machines that use the complete physical machine, 
with and without Hyperthreading enabled (4 or 8 logical cores respectively) and the equiva- 
lent physical machine with the same number of cores and RAM to compare the performance 
without virtualization. 

We plot in Fig. [10] the execution time for the parameter scan using from 1 (serial version) 
up to the number of cores available in each machine. The parallel versions time include also 
the final merge of the partial result files. 

As we see the performance degradation due to virtualization is minimal, below 5% for 
all executions, and the difference in execution time with and without HyperThreading for 
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Figure 10: Execution time in seconds of the application for different number of processes, 
both in Virtual and Physical machines with and without HyperThreading. 



the same number of processes is negligible. The difference between the virtual and physical 
machine decreases as the number of processes grows above 4. This effect, also seen in the case 
of multiple processes in Section [3j is due to the use different management the HyperThreading 
cores at the the virtualized Operating System. 

Since there is no communication overhead in the implementation, the application scales 
linearly with the number of processes given equally powerful CPUs. As seen in the plot, the 
scalability of the application is almost linear up to 4 processes (the same number of processes 
as available physical cores) and it flattens as the Operating System uses the logical cores 
provided by the HyperThreading. 



5 Conclusions 

We have described a new computing environment for particle physics phenomenology that 
can easily be translated to other branches of science. It is based on "virtual machines" , using 
the OpenStack infrastructure. In view of the performance, it is necessary to distinguish 
between two questions: the benefits that virtualization brings to researchers in terms on 
accesibility to computing resources, and the question of code performance and in general 
penalties due to the host operating system. 

About the first question the setup of OpenStack and the development of the self-instant- 
iation mechanism has been clearly appreciated by the researchers doing this type of computa- 
tions. The solution removes many of the barriers described in the introduction of this article 
regarding complex code installation, machine availability, and automatization of workflows. 

An additional benefit of this set-up is that OpenStack allows the user taking snapshots of 
the virtual machine, which are stored on a repository, and which the owner of the snapshot 
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can instantiate again at any moment, recovering the session as they saved it. This is a very 
practical feature because it allows researchers to "save" the current status on the virtual 
machine, and continue working at any other moment without blocking the hardware in the 
mean time. 

The second question is performance. We have analyzed a set of representative codes in 
the area of particle physics phenomenology, so that our results can be extrapolated to similar 
codes in the area. The results are very positive, as no excesive penalty due to virtualization 
can be observed. At most we observe degradations in performance on the order of 3% for 
the parts of the codes dominated by Floating Point Calculations. For other calculations the 
degradation was even less. We have furthermore analyzed the influence of system time in the 
virtual machines. We found that the virtualization has no significant impact on the system 
time. 

Evidently, the possibility of accessing resources in a more flexible way, the time that 
researchers spare using the new environment on software configuration compensates largely 
the usage of virtualized resources for the codes under investigation. 
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Appendix A: User access mechanisms 

In this Appendix we describe the user access mechanisms that we have implemented. 
A.I Authentication 

Keystone performs the validation of user's credentials (username/password) using a con- 
figurable back-end, which also stores all the data and associated meta-data about users, 
tenants (groups in OpenStack terminology) and roles associated to each user. There are 
four back-ends provided in the default OpenStack distribution: Key- Value Store (KVS), 
which provides an interface to other stores that can support primary key lookups (like a 
hash table); SQL, that stores data in a persistent database using SQLAlchemy; LDAP, that 
uses a LDAP server where users and tenants are stored in separate subtrees; and PAM, for 
simple mapping of local system users to OpenStack. The typical deployment uses the SQL 
backend. 

Additional authentication mechanisms, e.g. those not based on username/password cre- 
dentials, can be implemented using the Pluggable Authentication Handler (PAH) mechanism. 
A pluggable authentication handler analyzes each user request and, if it's able to handle it, 
authenticates the user before the default back-end is executed. 
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Scientific computing and data centers provide services to scientific collaborations that 
often go beyond a single institution, creating federation of resources where users can access 
to the different providers using a single identity. 

Grid infrastructures base the authentication and authorization of users on X.509 certifi- 
cates and Virtual Organizations (VO). A Virtual Organization comprises a dynamic set of 
individuals (or institutions) defined around a set of resource-sharing rules and conditions. 
The current pan-european grid infrastructure use the Virtual Organization Management Sys- 
tem (VOMS) for managing the users and resources within each VO. VOMS provides signed 
assertions regarding the attributes of a user belonging to a VO, thus enabling providers to 
trust these assertions and to define access rules to their resources based on the attributes. 
While a broad community of users are already familiarized with these authentication mech- 
anisms, the use of X.509 certificates and proxies is considered to be one of the main barriers 
for new users and communities. 

There are other mechanisms to provide identity federation across multiple providers that 
do not require the use of certificates, but these are poorly supported by the current scien- 
tific collaborations for accessing computing resources. LDAP can also be used to support 
federated authentication. LDAP is a well-known solution for authentication and it is widely 
used within the scientific data centers. However, the LDAP back-end included Openstack 
enforces a particular schema that does not fit most of existing deployments. 

We have extended the authentication capabilities of Keystone with two new Pluggable 
Authentication Handlers, one supporting VOMS and another supporting LDAP authentica- 
tion with arbitrary schema for storing user information. These modules enable the creation 
of a federated cloud infrastructure where users have a single identity across the different 
resource providers. The VOMS module enables the re-use of grid identity systems and lever- 
ages from existing experience of the resource providers, while the LDAP module enables 
the creation of simple federation without the inconvenience of the X.509 certificates for new 
users and communities. 

A.II VOMS Authentication 

The VOMS authentication module is implemented as a Pluggable Authentication Handler 
in Keystone and executed as a WSGI [23] of an httpd server enabled to use OpenSSL and 
configured to accept proxy certificates (VOMS assertions are included in proxy certificates). 
Fig. [TT] shows the call sequence for the authentication and authorization of a user using 
VOMS. 

Prior to the authentication, the user creates a proxy by contacting the VOMS server. This 
proxy includes the distinguished name (DN) of the user and a set of attributes related to the 
VO (group and roles associated with the user). The authentication in Keystone is performed 
by requesting a token to the server, if successful, Keystone will return a token which is used 
for any subsequent calls to the other OpenStack services. In our implementation, the user 
authenticates against the httpd server with the VOMS proxy, and the server, after validation, 
includes the SSL information in the request environment. This information reaches the 
VOMS module that will authorize the request checking if the proxy is valid and if the VO 
is allowed in the server. 

Once the proxy is considered valid and allowed, the module maps VO attributes included 
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Figure 11: VOMS Authentication sequence diagaram. 



in the proxy to a local OpenStack tenant using a configuration file. The user DN in the 
proxy is used as user name in Keystone. The mapped local tenant must exist in advance 
for a user to be authenticated. The VOMS module can automatically create the user in 
Keystone if enabled in configuration. This allows to easily establish access policies based 
only in the membership to a given VO, instead of giving access to individuals. Once a user 
has been granted access, the administrator can manage it as with any other user in the 
Keystone back-end (i.e. disable/enable, grant/revoke roles, etc.). 

A. Ill LDAP Authentication 

The LDAP authentication modules takes profit from the authentication features of the 
apache httpd server. In this case the authentication phase is delegated to the server which 
can authenticate users against LDAP with arbitrary schemas. This way, we use a reliable 
and tested code that is used in production systems and is actively maintained by the apache 
developers. Moreover, we avoid the introduction of security risks by minimizing the code the 
deals with the authentication. Once the server has authenticated the request, it will pass 
the user name through the environment of the WSGI module. 

The LDAP module uses a configuration file that specifies using regular expressions which 
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users are allowed and to what tenants they should be mapped in the system. If the user is 
authorized, the user can be created automatically in Keystone as in the VOMS case. Finally, 
the token is returned to the user. Figure [T5] shows the sequence diagram for this process. 
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Figure 12: VOMS Authentication sequence diagaram. 



Appendix B: Detailed computation times 



In this Appendix we give detailed numbers on the execution times of the analysis in Sect. El 
The notations are the same as in that section. 



machine 
(cores) 


Math 


Fortran 


Total 


real 


user 


sys 


real 


user 


sys 


real 


user 


sys 


Sht{1) 


1661.06 


1601.87 


48.38 


5500.06 


5466.54 


11.07 


7161.24 


7068.43 


59.47 


M HT {2) 


1617.06 


1613.56 


28.37 


5480.03 


5465.36 


12.20 


7097.20 


7078.99 


40.58 


Lht (4) 


1617.05 


1615.18 


29.38 


5475.64 


5461.08 


12.28 


7092.80 


7076.33 


41.68 


XLht(8) 


1617.19 


1615.86 


31.00 


5475.95 


5461.20 


12.48 


7093.25 


7077.12 


43.50 



Table 1: Computation time (sec) of virtual machines with hyperthreading (HT), divided in 
Mathematica part, Fortran part, and total time. Computation time is divided in real, user, 
sys, denoting respectively real time, processor time for computation, and system time. 
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machine 
(cores) 


Math 


Fortran 


real 


user 


sys 


real 


user 


sys 


M nHT {2) 


1616.78 


1613.12 


28.20 


5281.21 


5266.46 


11.69 




1618.9 


1613.9 


33.8 


5278.9 


5264.0 


12.6 



Table 2: Computation time (sec) of virtual machines without hyperthreading (nHT). 



machine 
(cores) 


Math 


Fortran 


real 


user 


sys 


real 


user 


sys 


Rht{8) 


1605.1 


1584.8 


13.9 


5306.9 


5290.8 


6.6 



Table 3: Computation time (sec) of real "physical" machine R with hyperthreading (HT). 



machine 
(cores/proc.) 


Math 


Fortran 


real 


user 


sys 


real 


user 


sys 


M HT (2/1) 


1617.1 


1613.6 


28.4 


5480.0 


5465.4 


12.2 


M HT (2/2) max 


1708.9 


1647.2 


52.6 


5500.7 


5472.7 


12.2 


M HT (2/ 2) min 


1713.4 


1650.6 


53.4 


5493.6 


5469.7 


12.2 




1617.1 


1615.2 


29.4 


5475.6 


5461.1 


12.3 


L HT (A/2) max 


1678.6 


1653.2 


39.5 


5488.0 


5473.3 


12.4 


L HT (4:/2) min 


1672.4 


1656.4 


36.1 


5491.0 


5476.0 


12.6 


£#t(4/4) max 


1771.1 


1709.0 


54.5 


5602.4 


5580.7 


12.3 


L#t(4/4) min 


1775.3 


1711.8 


53.8 


5511.9 


5489.8 


12.7 


XL HT {8/1) 


1617.2 


1615.9 


31.0 


5476.0 


5461.2 


12.5 


XL HT (8/2) max 


1678.4 


1671.2 


34.4 


5492.4 


5477.1 


13.1 


XL HT (8/2) min 


1676.8 


1668.6 


35.3 


5493.4 


5478.2 


13.0 


XL HT (8/4) max 


1807.0 


1790.9 


39.7 


5566.1 


5549.4 


13.9 


XL HT (8/4) min 


1809.6 


1786.3 


45.1 


5521.6 


5504.8 


14.0 


XL HT (8/6) max 


2385.7 


2333.8 


62.8 


7706.1 


7684.6 


17.2 


XL HT (8/6) min 


2358.8 


2306.8 


63.0 


7558.8 


7535.5 


17.2 


XL HT (8/8) max 


2835.5 


2741.8 


78.2 


9375.0 


9344.9 


18.7 


XL HT (8/8) min 


2818.9 


2728.8 


77.5 


9329.9 


9295.4 


18.6 



Table 4: Computation time (sec) of virtual with multiple equal processes. 
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machine 
(cores/proc.) 


Math 


Fortran 


real 


user 


sys 


real 


user 


sys 


i*nr(8/l) 


1605.1 


1584.8 


13.9 


5306.9 


5290.8 


6.6 


R HT {8/2) max 


1683.7 


1642.9 


15.7 


5345.4 


5329.3 


6.5 


R HT {8/2) min 


1686.9 


1646.2 


17.1 


5311.7 


5295.5 


6.6 


R H t{8/4) max 


1763.4 


1696.1 


20.6 


5401.6 


5384.9 


6.9 


Rht{8/4:) min 


1765.6 


1700.4 


19.9 


5317.6 


5301.0 


6.9 


Rht(8/6) max 


2572.4 


2449.5 


29.4 


8063.2 


8039.8 


8.5 


Rht(8/6) min 


1899.1 


1804.9 


23.4 


7060.7 


7039.5 


8.0 


R HT {8/8) max 


2882.7 


2707.6 


48.8 


9320.7 


9288.5 


10.0 


R HT {8/8) min 


2862.3 


2685.2 


47.7 


9291.8 


9261.2 


9.6 



Table 5: Computation time (sec) of real machine R with HT with multiple equal processes. 
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